Computational Literary Studies
A gentle introduction
28.04.2025 CUDAN Open Lab
seminar
First, what CLS is about
- Computational Literary Studies
- Aimed at analyzing (large amounts of) textual data…
- … by computational techniques
Leon Battista Alberti
![]()
Leon Battista Alberti, De componendis cifris, ca. 1466
Computation into criticism
![]()
John Burrows, Computation into Criticism, 1987
Distant reading
![]()
Franco Moretti, Matt Jockers, Ted Underwood
Sociology of reading
![]()
Karina van Dalen-Oskam, Het raadsel literatuur, 2021
Quantitative linguistics
Foundations of CLS
- Computation into criticism
- Distant reading
- Stylometry
- Authorship attribution
- Digital humanities
- Language resources
- Digital libraries
- Natural language processing
- Machine learning
- Big data
- …
What CLS has to offer
- Scientific method
- reproducibility, empirical paradigm, statistical modeling,
probabilistic inference, …
- Scale
- access to unprecedented amounts of data
- Accuracy ability to capture patterns invisible to a naked eye
1,000 Polish novels
![]()
Combination of factors needed
- Datasets (language resources)
- Tools (computer programs)
- Suitable methodology
- Computer power (i.e. scientific instruments)
Not possible individually
Libraries, journals, publishers, …
![]()
Dictionaries at IJP PAN
![]()
ELTeC corpus
![]()
DraCor
![]()
CLS INFRA
An infrastructural project for computational literary studies,
founded by Horizon 2020 scheme
infrastructures in DH and CLS
- in hard sciences, infrastructures are tangible
- servers, telescopes, accelerators, …
- in the humanities, institutions are essential
- libraries, publishing houses, journals, …
- in DH, multifaceted needs
- the notion of infrastructure needs reconsideration
- corpora (FAIR!) but not only
CLS INFRA project
- text collections (corpora)
- quality
- metadata
- conversion
- methodology
- tools (NLP, datavis, …)
- tool chains
- methodological considerations
- bibliographic survey
- network of scholars
- training schools
- short-term research stays
- collaboration with COST Action
Overarching idea is to connect…
- People
- To establish a network of CLS researchers
- Data
- To consolidate existing high-quality corpora…
- …covering prose, drama and poetry
- Tools
- To build a chain of NLP tools to analyze texts
- Methods
- To provide a survey of state-of-the-art methods
activities
- training schools
- Prague 2022, Madrid 2023, Vienna 2024
- workshops
- closing event
- transnationan access fellowships
- short-term research stays…
- in one of 6 institutions:
selected deliverables
- 3.1 Report on the methodological baseline for (computational)
literary studies
- 4.1 Report on the skills matrix for computational literary
studies
- 5.1 Review of the data landscape
- 6.1 Assembly of existing data
survey of methods
![]()
CLS-centric Discord server
![]()
Why text analysis?
- Authorship attribution
- Forensic linguistics
- Register analysis
- Genre recognition
- Gender differences
- Translatorial signal
- Early vs. mature style
- Style evolution
- Detecting dementia
- …
stylometry
- measures stylistic differences between texts
- oftentimes aimed at authorship attribution
- relies on stylistic fingerprint, …
- … aka measurable linguistic features
- frequencies of function words
- frequencies of grammatical patterns, etc.
- proves successful in several applications
How two compare texts?
- Extracting valuable (i.e. countable) language features from texts
- frequencies of words 👈
- frequencies of syllables
- versification patterns
- grammatical patterns
- distribution of topics
- …
- Comparing these features by means of multivariate analysis
- distance-based methods 👈
- neural networks
- …
From words to features
‘It is a truth universally acknowledged, that a single man in
possession of a good fortune, must be in want of a wife.’ (J.
Austen, Pride and Prejudice)
“the” = 4.25%
“in” = 3.45%
“of” = 1.81%
“to” = 1.44%
“a” = 1.37%
“was” = 1.17%
. . .
From features to similarities
## the and to a of he i in
## Capote_Blood_1966 5.226 2.828 2.350 2.909 2.323 1.907 1.562 1.346
## Capote_Breakfast_1958 4.178 2.489 2.230 3.117 1.829 1.277 3.059 1.273
## Capote_Crossing_1950 4.079 2.848 2.119 2.998 2.134 1.528 1.061 1.428
## Capote_Harp_1951 4.597 2.462 2.326 3.030 1.950 1.103 2.253 1.390
## Capote_Voices_1948 4.661 3.201 1.808 3.146 1.727 1.868 1.237 1.481
## Faulkner_Absalom_1936 5.567 4.802 2.649 1.463 2.145 2.219 0.835 1.374
## Faulkner_Dying_1930 5.268 3.487 2.321 1.883 1.403 2.152 3.102 1.209
## Faulkner_Light_1832 5.611 3.660 2.350 1.940 1.611 3.482 1.325 1.365
## Faulkner_Moses_1942 6.077 4.910 2.275 1.526 1.819 2.685 0.836 1.366
## Faulkner_Sound_1929 4.391 3.046 2.631 1.812 1.259 1.427 3.902 1.167
## Glasgow_Phases_1898 6.065 3.357 1.983 2.625 3.071 2.203 1.639 1.632
## Glasgow_Vein_1935 5.153 2.708 2.389 2.249 2.051 1.563 1.651 1.697
## Glasgow_Virginia_1913 5.447 2.459 2.691 2.045 3.217 1.434 1.582 1.656
## HarperLee_Mockingbird_1960 3.936 2.336 2.451 1.855 1.528 1.934 2.559 1.369
## HarperLee_Watchman_2015 4.283 2.566 2.471 1.921 1.781 1.291 1.730 1.529
## McCullers_HeartIsaLo_1940 6.162 4.020 2.441 2.362 1.905 2.571 1.042 1.681
## McCullers_Member_1946 6.472 4.417 2.210 2.445 1.743 0.816 1.245 1.523
## McCullers_Reflection_1941 7.192 3.382 2.125 2.888 2.399 2.622 0.397 1.902
## OConnor_Everything_1956 5.597 3.198 2.578 2.461 1.965 2.939 0.948 1.700
What we hope to get

stylometry beyond attribution
![]()
areas of improvement
- classification method
- distant-based
- svm, nsc, knn, …
- neural networs
- …
- feature engineering
- dimension reduction
- lasso
- …
- feature choice
- MFWs
- POS n-grams
- character n-grams
- …
simple normalization
Occurrences of the most frequent words (MFWs):
##
## the and to i of a in was her it you he she that not my
## 4571 4748 3536 4130 2224 2326 1484 1127 1551 1391 1895 2138 1338 1250 937 1106
Relative frequencies:
##
## the and to i of a in was her it you
## 0.0383 0.0398 0.0296 0.0346 0.0186 0.0195 0.0124 0.0094 0.0130 0.0116 0.0159
relative frequencies
The number of occurrences of a given word divided by the total number
of words:
\[ f_\mathrm{the} = \frac{n_\mathrm{the}}{
n_\mathrm{the} + n_\mathrm{of} + n_\mathrm{and} + n_\mathrm{in} + ... }
\]
In a generalized version:
\[ f_{w} = \frac{n_{w}}{N} \]
relative frequencies
- routinely used
- reliable
- simple
- intuitive
- conceptually elegant
synonyms
Proportions within synonym groups might betray a stylistic
signal:
- on and upon
- drink and beverage
- buy and purchase
- big and large
- et and atque and ac
proportions within synonyms
The proportion of on to upon:
\[ f_\mathrm{on} = \frac{n_\mathrm{on}}{
n_\mathrm{on} + n_\mathrm{upon} } \]
The proportion of upon to on:
\[ f_\mathrm{upon} =
\frac{n_\mathrm{upon}}{ n_\mathrm{on} + n_\mathrm{upon} } \]
Certainly, they sum up to 1.
‘on’/total vs. ‘on’/(‘upon’ + ‘on’)

‘the’/total vs. ‘the’/(‘of’ + ‘the’)

limitations of synonyms
- in many cases, several synonyms
- cf. et and atque and ac in Latin
- in many cases, no synonyms at all
- target words might belong to different grammatical categories
- what are the synonyms for function words?
- provisional conclusion:
- synonyms are but a subset of the words that matter
semantic similarity
- target words: synonyms and more
- e.g. for the word make the target words can involve:
- perform, do, accomplish, finish,
reach, produce, …
- all their inflected forms (if applicable)
- derivative words: nouns, adjectives, e.g. a deed
- the size of a target semantic area is unknown
word vector models
- trained on a large amount of textual data
- capable of capturing (fuzzy) semantic relations between words
- many implementations:
- word2vec
- GloVe
- faxtText
- …
GloVe model: examples
the neighbors of house:
## house where place room town houses farm rooms the left
## 1.000 0.745 0.688 0.685 0.664 0.651 0.647 0.638 0.637 0.635
the neighbors of home:
## home return come coming going london back go went came
## 1.000 0.732 0.717 0.705 0.696 0.689 0.682 0.670 0.666 0.665
the neighbors of buy:
## buy sell wanted want sold get send wants give money
## 1.000 0.728 0.552 0.539 0.537 0.536 0.535 0.531 0.526 0.510
the neighbors of style:
## style quality fashion manner type taste manners language
## 1.000 0.597 0.565 0.560 0.547 0.527 0.518 0.512
## proper english
## 0.504 0.504
relative frequencies revisited
for a 2-word semantic space, the frequency of the word
house:
\[ f_\mathrm{house} =
\frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} +
n_\mathrm{place} } \]
for a 5-word semantic space, the frequency of the word
house:
\[ f_\mathrm{house} =
\frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} +
n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses}
} \]
for a 7-word semantic space, the frequency of the word
house:
\[ f_\mathrm{house} =
\frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} +
n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses}
+ n_\mathrm{farm} + n_\mathrm{rooms} } \]
experimental setup
- a corpus of 99 novels in English
- by 33 authors (3 texts per author)
- tokenized and classified by the package
stylo
- stratified cross-validation scenario
- 100 cross-validation folds
- distance-based classification performed
- F1 scores reported
distance measures used
- classic Burrows’s Delta
- Cosine Delta (Wurzburg)
- Eder’s Delta
- raw Manhattan distance
results for Cosine Delta

results for Burrows’s Delta

results for Eder’s Delta

results for Manhattan Distance

the best F1 scores
- Cosine Delta: 0.96
- Burrows’s Delta: 0.84
- Eder’s Delta: 0.83
- raw Manhattan: 0.77
how good are the results?
- we know that Cosine Delta outperforms Classic Delta etc.
- what is the actual gain in performance, then?
- an additional round of tests performed to get baseline
- the gain above the baseline reported below
gain for Cosine Delta

gain for Burrows’s Delta

gain for Eder’s Delta

gain for Manhattan Distance

conclusions
- in each scenario, the gain was
considerable
- the hot spot of performance varied depending on the method…
- … yet it was spread between 5 and 100 semantic neighbors
- best classifiers are even better: up. to 12% improvement!
conclusions (cont.)
- the new method is very simple
- it doesn’t require any NLP tooling…
- … except getting a general list of n semantic neighbors for
MFWs
- such a list can be generated once and re-used several times
- if a rough method of tracing the words that matter was
already successful, a bigger gain can be expected with sophisticated
language models
alt semantic space
- perhaps the n closest neighbors is not the best way to
define semantic spaces
- therefore: testing all the words at the cosine distance of
x from the reference word
results for cosine similarities

gain for Cosine Delta

results for delta similarities

gain for Burrows’s Delta

results for Eder’s similarities

gain for Eder’s Delta

What is a distance?
take any two texts:
## the and to of a was I in he said you
## lewis_lion 5.141 3.699 2.295 2.185 2.100 1.346 0.813 1.162 1.087 1.426 1.141
## tolkien_lord1 5.624 3.782 2.074 2.597 1.916 1.313 1.492 1.419 1.221 0.825 0.872
subtract the values vertically:
## the and to of a was I in he said you
## -0.483 -0.083 0.221 -0.412 0.184 0.033 -0.679 -0.257 -0.134 0.601 0.269
then drop the minuses:
## the and to of a was I in he said you
## 0.483 0.083 0.221 0.412 0.184 0.033 0.679 0.257 0.134 0.601 0.269
sum up the obtained values:
## [1] 3.356
Manhattan vs. Euclidean
![]()
Euclidean distance
between any two texts represented by two points A and B in an
n-dimensional space can be defined as:
\[ \delta_{AB} = \sqrt{ \sum_{i = 1}^{n}
(A_i - B_i)^2 } \]
where A and B are the two documents to be compared,
and \(A_i,\, B_i\) are the scaled
(z-scored) frequencies of the i-th word in the range of
n most frequent words.
Manhattan distance
can be formalized as follows:
\[ \delta_{AB} = \sum_{i = 1}^{n} | A_i -
B_i | \]
which is equivalent to
\[ \delta_{AB} = \sqrt[1]{ \sum_{i =
1}^{n} | A_i - B_i |^1 } \]
(the above weird notation will soon become useful)
They are siblings!
\[ \delta_{AB} = \sqrt[2]{ \sum_{i =
1}^{n} (A_i - B_i)^2 } \]
vs.
\[ \delta_{AB} = \sqrt[1]{ \sum_{i =
1}^{n} | A_i - B_i |^1 } \]
For that reason, Manhattan and Euclidean are named L1 and L2,
respectively.
An (infinite) family of distances
- The above observations can be further generalized
- Both Manhattan and Euclidean belong to a family of (possible)
distances:
\[ \delta_{AB} = \sqrt[p]{ \sum_{i =
1}^{n} | A_i - B_i |^p } \]
where p is both the power and the degree of the root.
The norms L1, L2, L3, …
- The power p doesn’t need to be a natural number
- We can easily imagine norms such as L1.01, L3.14159, L1¾, L\(\sqrt{2}\) etc.
- Mathematically, \(p < 1\)
doesn’t satisfy the formal definition of a norm…
- … yet still, one can easily imagine a dissimilarity L0.5 or
L0.0001.
- (plus, the so-called Cosine Distance doesn’t satisfy the definition
either).
To summarize…
- The p parameter is a continuum
- Both \(p = 1\) and \(p = 2\) (for Manhattan and Euclidean,
respectively) are but two specific points in this continuous space
- p is a method’s hyperparameter to be set or possibly
tuned
🧐 How do the norms from a wide range beyond L1 and L2 affect text
classification?
Data
Four full-text datasets used:
- 99 English novels by 33 authors,
- 99 Polish novels by 33 authors,
- 28 books by 8 American Southern authors:
- Harper Lee, Truman Capote, William Faulkner, Ellen Glasgow, Carson
McCullers, Flannery O’Connor, William Styron and Eudora Welty,
- 26 books by 5 fantasy authors:
- J.K. Rowling, Harlan Coben, C.S. Lewis, and J.R.R. Tolkien.
Method
- A supervised classification experiment was designed
- Aimed at authorship attribution
- leave-one-out cross-validation scenario
- 100 independent bootstrap iterations…
- … each of them involving 50% randomly selected input features (most
frequent words)
- The procedure repeated for the ranges of 100, 200, 300, …, 1000 most
frequent words.
- The whole experiment repeated iteratively for L0.1, L0.2, …,
L10.
- The performance in each iteration evaluated using accuracy, recall,
precision, and the F1 scores.
99 English novels by 33 authors
![]()
99 Polish novels by 33 authors
![]()
28 novels by 8 Southern authors
![]()
26 novels by 5 fantasy writers
![]()
A few observations
- Metrics with lower \(p\) generally
outperform higher-order norms.
- Specifically, Manhattan is better than Euclidean…
- … but values \(p < 1\) are even
better.
- Feature vectors that yield the best results (here: long vectors of
most frequent words) are the most sensitive to the choice of the
distance measure.
Plausible explanations
- Small \(p\) makes it more important
for two feature vectors to have fewer differing features (rather than
smaller differences among many features),
- Small \(p\) amplifies small
differences (important, e.g., for low-frequency features in
distinguishing between 0 difference – for two texts lacking a feature –
and a small difference).
Therefore:
- Small \(p\) norms might be one way
of effectively utilizing long feature vectors.
L p distances vs. Cosine Delta
English novels:
| 100 |
0.625 |
0.625 (p = 0.9) |
0.666 |
| 500 |
0.814 |
0.823 (p = 0.6) |
0.865 |
| 1000 |
0.833 |
0.871 (p = 0.3) |
0.892 |
Polish novels:
| 100 |
0.655 |
0.659 (p = 0.8) |
0.684 |
| 500 |
0.760 |
0.769 (p = 0.6) |
0.840 |
| 1000 |
0.751 |
0.835 (p = 0.1) |
0.842 |
99 English novels by 33 authors
![]()
99 Polish novels by 33 authors
![]()